Notebook

Arango in Python¶

To work with Arango in Python we simply need to leverage the python-arango package (pip install python-arango and import arango)

In [ ]:

# We'll import networkx and matplotlib for some light visualizations
import json
import networkx as nx
import matplotlib.pyplot as plt

from arango import ArangoClient

Connecting¶

Like almost every other database we connect to, we need to connect to the server. With Arango, we use the arango.ArangoClient.

We simply need to provide:

protocol - 'http'
host - uri/ipaddress
port - 8529 by default

Once connected we can access a given database by passing in our credentials client.db()

Finally we can also connect to a graph, by name, by using db.graph()

In [ ]:

# Our client connection
client = ArangoClient(hosts='http://18.219.151.47:8529')

# Our database connection
db = client.db('emse6586', username='root', password='emse6586pass')

AQL Queries¶

Once connected to a database, we can query the database using AQL, simply by executing it.

db.aql.execute(AQL_query)

In [ ]:

query = """FOR tweet IN statuses
            LIMIT 10
            RETURN tweet"""
results = db.aql.execute(query)
print(results)

Arango Cursor¶

Like other connections, we get back curosr objects from our executed queries. We can access these exactly the same way we interact with other cursors

In [ ]:

tweets = list(results)
print(tweets[0])

Given the dictionary-like structure of the objects, they can be easily loaded into a DataFrame

In [ ]:

import pandas as pd
df = pd.DataFrame(tweets)
df.head()

Arango Traversals¶

Given Arango is built around traversing graphs and graph-like queries, pyarango provides a streamlined API. However, to use the simplified API requires a graph to have been defined within Arango.

If you recall from the lecture, we have one called twitter_sphere which we can hook into. The graph can be initialized by db.graph({graph_name})

In [ ]:

# Our graph connection
graph = db.graph('twitter_sphere')

results = graph.traverse(max_depth=2, direction='any', start_vertex='users/22203756', vertex_uniqueness='global')
print(results.keys())

The main difference with this resultset is the format of the datastructure. Given that it is a graph traversal there are three different subelements:

paths
vertices
edges

Each of these elements has it's own structure that will mirror those we saw when reviewing graph traversals.

In [ ]:

print(f'Number of paths: {len(results["paths"])}')
print(f'Number of vertices: {len(results["vertices"])}')

In [ ]:

print(json.dumps(results['paths'][4000], indent=2))

In [ ]:

for edge in results['paths'][4000]['edges']:
    print(edge)

Traversals return vertices, paths, and edges (depending on the traversal)

NetworkX Integration¶

Using networkx and matplotlib, we can actually plot some of these graphs

In [ ]:

def populate_from_query(results, G, limit=100):
    """Given results from a query populate a networkx graph
    Args:
        results (list/dict) - Results from an AQL graph
        G (networkx.Graph) - A networkx graph
        limit (int) - Limit to number of nodes/edges to populate
    """

    edge_count = 0
    for result in results['paths']:

        nodes = result['vertices']
        edges = result['edges']

        for edge in edges:
            if edge_count % 100 == 0:
                print(f'{edge_count} of {len(edges)}')
            from_user = edge['_from']
            to_user = edge['_to']
            for node in nodes:
                if node['_id'] == from_user:
                    if 'screen_name' in node:
                        from_node = node['screen_name']
                    elif 'status_id' in node:
                        from_node = node['status_id']
                if node['_id'] == to_user:
                    if 'screen_name' in node:
                        to_node = node['screen_name']
                    elif 'status_id' in node:
                        to_node = node['status_id']
            G.add_edge(from_node, to_node)
            edge_count += 1

            if edge_count > limit and limit != -1:
                return

In [ ]:

fig, ax = plt.subplots(1, 1, figsize=(16, 14));

G = nx.Graph(ax=ax)
populate_from_query(results, G, 25)

pos = nx.spring_layout(G, k=.01)
nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)
nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)
nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')
#nx.draw(G, pos, font_size=16, with_labels=True)
for p in pos:  # raise text positions
    pos[p][1] += 0.07

In-Class Work¶

Let's apply this kind of logic to solve a more interesting, and complex, problem.

In our twitter data we'll focus on two people, and the relationships that connect them together:

Elon Musk
Tim Cook

Writing Our Query¶

To start we need to write our query to identify the paths that link our two people together.

We'll focus on just friendships and only allow our traversal to search a depth of 2. Meaning that we will only allow one intermediary friend to link our two people together.

start_node (Elon) = users/44196397
end_node (Tim) = tim_cook

In [ ]:

# Space for the query

Cleaning the results¶

Because our data treats friendship as a single direction, we can end up with paths that touch the same vertices (while techincally being a separate path). So let's clean it up to see how many actual intermediate people connect our two users.

Identify the unique vertexes for the query:

In [ ]:

# Space to identify unique connections

Plotting Our Data¶

Now lets plot the data to see if our findings are corroborated by the graphs visualization. We can hook into the previously created populate_from_query function, however you will need to modify the results to work.

In [ ]:

# Space for our graph
fig, ax = plt.subplots(1, 1, figsize=(14, 8));

G = nx.Graph(ax=ax)
populate_from_query({'paths': paths}, G, 150)

pos = nx.spring_layout(G, k=.3)
nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)
nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)
nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')
#nx.draw(G, pos, font_size=16, with_labels=True)
for p in pos:  # raise text positions
    pos[p][1] += 0.07

How About in SQL¶

How would we do this same analysis using SQL?

Do you think that this would be harder or easier?

Expanding the Concept¶

Looking at only one jump wasn't to complicated when jumping back and forth between SQL and Arango (graphs), but how about expanding to enabling two jumps?

Change our AQL to search for the connections between Trump and The Rock with 1 to 2 intermediary friends:

In [ ]:

# Updated query

Scaling Solutions¶

How does changing AQL compare to what we would have to do to change the SQL query?

In [ ]:

#Handles one intermediary friend

Select * FROM Friends
    Where friend_from = "tim_cook"
    AND friend_to IN (SELECT friend_from FROM friends WHERE friend_to='elonmusk') as inter_1
    OR friend_to IN (SELECT friend_from FROM inter_1) as inter_2
    OR